Norco
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
Haas, Lukas, Yona, Gal, D'Antonio, Giovanni, Goldshtein, Sasha, Das, Dipanjan
We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- South America > Colombia (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- (7 more...)
- Leisure & Entertainment (1.00)
- Government (0.69)
- Media > Television (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)
Do Bayesian Neural Networks Improve Weapon System Predictive Maintenance?
This approach lacks the extra information on individual systems with interval-censored data and time-varying weapon system characteristics. A recent method introduced the covariates. We analyze and benchmark our approach, Weibull-Cox Bayesian Neural Network tested on several LaplaceNN, on synthetic and real datasets with standard weapon systems, albeit requiring a held-out validation set [7]. classification metrics such as Receiver Operating Characteristic Moreover, while understanding the population reliability trends (ROC) Area Under Curve (AUC) Precision-Recall (PR) AUC, via a Weibull distribution is informative, this formulation does and reliability curve visualizations.
- North America > United States > California > Riverside County > Norco (0.05)
- Asia > Pakistan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
Bayesian Weapon System Reliability Modeling with Cox-Weibull Neural Network
We propose to integrate weapon system features (such as weapon system manufacturer, deployment time and location, storage time and location, etc.) into a parameterized Cox-Weibull [1] reliability model via a neural network, like DeepSurv [2], to improve predictive maintenance. In parallel, we develop an alternative Bayesian model by parameterizing the Weibull parameters with a neural network and employing dropout methods such as Monte-Carlo (MC)-dropout for comparative purposes. Due to data collection procedures in weapon system testing we employ a novel interval-censored log-likelihood which incorporates Monte-Carlo Markov Chain (MCMC) [3] sampling of the Weibull parameters during gradient descent optimization. We compare classification metrics such as receiver operator curve (ROC) area under the curve (AUC), precision-recall (PR) AUC, and F scores to show our model generally outperforms traditional powerful models such as XGBoost and the current standard conditional Weibull probability density estimation model.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > California > Riverside County > Norco (0.04)
- North America > United States > New York (0.04)
- (3 more...)
- Health & Medicine (1.00)
- Government > Military (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.35)